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Determining Redundancies in Content Object Directories 

This invention relates generally to fingerprinting files for identification, more 
specifically, this invention relates to determining redundancies in file directories. 

5 BACKGROUND 

[01] One of the drawbacks to computer systems is the vast number of redundant files that 
are repeatedly copied and stored in multiple directories. While attempts are made to identify 
these files by a unique name or characteristic, what often results is that redundant files are 
saved multiple times in a directory or computer system. As a resuh, a waste of memory 

10 occurs in storing the redundant files. Furthermore, it is not uncommon for files to either 

innocently or intentionally be misidentified in a computing system. As a result, files residing 

; on the system may have an incorrect identifier that prevents them from being correctly or 
efficiently recognized by various users or application programs. 

'i [02] As one example, in the industry of downloading music files across the internet, it is 

i 5 not uncommon for a new artist to store their new song under the name of a popular artist. 
The theory is that by storing the new song under the name of the more popular artist, more 
people will likely download that particular file and listen to the misidentified song of the new 
artist. This is commonly referred to as "Napster bombing". Apparently, the new artists feel 
that by Napster bombing there is a greater chance of being discovered by the listening pubHc. 

20 In peer-to-peer networks, for example, one can access the directory of another user and view 
the available files of that user. Thus, the user who controls the physical directory can 
misidentify songs either intentionally or purposefully. A Napster bomber involved in a peer- 
to-peer network connection with another user can misidentify his or her new song and allow a 
second user to download that file for listening. Thus, the second user can waste a good deal 

25 of time in obtaining a copy of a song that was misidentified. 

[03] As another example, a memory system that contains redundant data can waste storage 
space that could be better used for nonredundant data. For example, as files are copied and 
stored during normal processes, they are given new names by users for easier identification. 
As a result, multiple files are stored on a computer system that contain the same data. Days, 

30 months, or years later, it is difficult to know from the file characteristics or identifiers, such 
as file names, whether the files are redundant or not. Thus, they are sunply maintained on the 
computing system by the housekeeping programs. 



[04] With the advent of downloading audio and video files across computer networks for 
viewing by users on their home computers, there is a great potential for not only storing 
redundant files but also Napster bombing video files. As a result, a user could potentially 
waste a good deal of time, for example in downloading a misidentified video file which has a 
5 substantially greater time involved in downloading as compared to a less memory intensive 
audio file. Furthermore, the servers or caching computers that will store data or content files 
such as video files will have limited memory capacity for storing purposes. Thus, it would be 
desirable to be able to eliminate any uimecessary redundant files. 

10 SUMMARY 

[05] One embodiment of the invention provides a system for eliminating redundant files 
a* stored in a computer directory. This embodiment of the invention can be accomplished by 
^; accessing multiple files stored on memory, wherein each of the files is configured to be 

identified by a fingerprint; determining a fingerprint for each of the files stored on the 
J3 5 memory; establishing a standard, such as a redundancy standard, to indicate when any two 
I'" fingerprints are redundant; comparing the fingerprints determined for each of the files; and 

determining which files are redundant based upon the comparison. 

[06] Redundant files can also be removed or deleted from the memory in one aspect. 

Furthermore, various types of fingerprints could be utilized, such as Fast Fourier Transform 
■*^0 (FFT) as the fingerprint, utilizing the watermark as the fmgerprint, or, utilizing CRC as the 

fingerprint. 

[07] In one embodiment, the system can be utilized to access various file formats such as 
audio files or video files. 

[08] In another embodiment, a identifier for a file can be provided by accessing the file; 

25 deriving a frequency representation of the file; providing a file name for the file; providing 
the file name in a directory; and, associating the frequency representation of the file with the 
file name so that the frequency representation is accessible via the directory. 
[09] Again, in various embodiments of the invention a Fourier Transform could be used, 
an FFT could be used, and a Discrete Fourier Transform (DFT) could be used. Furthermore, 

30 the fi:equency representation could be included as metadata in an address listing. 

[1 0] In another embodiment of the invention a method of searching for a file can be 
utilized by obtaining a first firequency representation of a desired file; accessing a first 
unknown file; obtaining a second frequency representation of the unknown file; comparing 
the first firequency representation of the desired file with the second frequency representation 
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of the unknown file; and, determining from the comparison whether the unknown file is the 
desired file. 

[11] Furthermore, in various aspects of this embodiment, the frequency representation can 
be obtained by different algorithms. For example, it could be performed utilizing an FFT, a 
5 Discrete Fourier Transform (DFT), or the like. 

[12] In another aspect of this embodiment, frequency comparisons can be performed by 
comparing a range of frequencies of the first and second frequency representations so as to 
determine whether they are equivalent. 

[13] Furthermore, this embodiment can utilize a decoder to decode a file prior to obtaining 
10 the frequency representation for that file. 

BRIEF DESCRIPTION OF THE DRAWINGS 
is [14] Fig. 1 provides a flowchart illustrating a method for removing redundant files 
Bj according to one embodiment of the invention. 
15 [15] Fig. 2 provides a flowchart illustrating a method for identifying files in a directory 
according to one embodiment of the invention. 
|U [16] Fig. 3 provides a flowchart illustrating a method of identifying an unknown file 
--». according to one embodiment of the invention. 

m [17] Fig. 4 illustrates a system for accomphshing the methods illustrated in Figs. 1, 2 and 
3. 

[18] Fig. 5 illustrates a system for accomplishing the components shown in Fig. 4. 
DESCRIPTION 

[19] With the advent of downloading audio and especially video files across computer 
25 networks, it is ever more important to be able to correctly identify a proper file. Namely, a 
great deal of computing resources are required in the downloading of such files, as they are 
very memory intensive and not only require a good deal of time to download but 
consequently occupy a good deal of computing bandwidth. Thus, it is inefficient to 
download files that have been misidentified and do not serve the purpose of the user who 
30 requests the misidentified file. Furthermore, the storage space on computing systems is an 
ever present problem and it is beneficial when redundant files can be identified and removed 
from a computer's memory to create additional storage space. 

[20] In one embodiment of the invention, a method is provided to identify redundant files 
in a computing system. Such a method can be useful in identifying files on a user's own 
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computer as well as identifying files on the computer of another, hi Fig. 1, a method 100 is 
Illustrated in which according to block 104 a multiple number of files stored on a computer's 
memory is accessed. For example, the files on a user's own computer can be accessed. 
According to block 108, a fingerprint for each of the files is then determined. 

5 [21] The determination of a fingerprint can take a variety of forms. As one example, a 
FFT can be utilized. The FFT can be generated according to any commercially available 
program or chip for computing FFTs such as FFTW version 2.1 .3 developed at MIT by 
Matteo Frigo and Steven G. Johnson and currently available for free at the FFTW.ORG 
website. (The algorithms for computing Fourier Transforms, Discrete Fourier Transforms 

1 0 and Fast Fourier Transforms are presented, for example, in Signals and Systems, by 
Oppenheim and Willsky, Prentice Hall 1983.) 

[22] With an FFT, for example, the audio characteristics of a song could be sampled and a 
f: FFT could be generated for that particular song. Thus, a FFT characteristic of that file could 

be generated. The FFT characteristics will vary depending on the portion of the file that is 
1 5 utilized to generate the FFT. Furthermore, the length of the segment of the song that is 
i ^ utilized can impact the resulting FFT. 

[23] As another example, a watermark could be utilized as the fmgerprint for a file. 

Namely, a watermark that is placed on a file could not only serve to identify whether the file 

is authentic, but also it could be utihzed to identify the characteristics of that file. 

-Id Furthermore a CRC could be generated for a particular file so as to derive a unique identifier 

=.1 

for that file. 

[24] In block 112, a redundancy standard is established so as to indicate when two files are 
redundant of one another. For example, in the case of an FFT, the requirements for sampling 
a file could be utilized to state that the first five minutes of playing time of the file are 

25 sampled at a specific sampling rate. In addition, the resulting frequency histogram that is 

generated can vary by a predetermined percentage in comparison of the histograms of the two 
files. For example, if a histogram for file a is generated and histogram for file b is generated, 
a common pattern of the histograms may vary, for example, by five percent, and still be 
considered a redundant file. The various characteristics that are utilized for determining 

30 whether files are redundant can be selected by the user. For example, stricter requirements 
could be utilized, such as an exact match between fingerprints of two files in determining 
whether they are redundant of one another. 

[25] In block 116 the fingerprints of at least some of the files can be compared. Thus, as 
illustrated in block 120 of Fig. 1 , a determination can be made based on the comparison in 
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block 1 16 as to whether any two files are redundant of one another. The redundancy standard 
that was established can be utilized to provide criteria for determining in the comparison 
whether the two fingerprints satisfy the criteria of the established redundancy standard. 
Alternatively, even without an established standard, the results of a comparison could be 
5 displayed for viewing by a user to allow the user to determine whether the fingerprints are 
sufficiently similar to be considered redundant files. 

[26] In block 124 any files that have been determined to be redundant can be removed. 
For example, a list of files and file characteristics could be displayed for viewing by the user 
showing which files are redundant of one another. Thus, the user could make the final 
1 0 determination as to whether to remove files or keep them on the file system. It is envisioned 
that in most instances, such redundant files will simply be deleted from the memory and file 

0 directories of the computer system so as to fi-ee space for use by new files. Alternatively, 

Q 

;?3 some files may be retained even though they satisfied the redundancy standard. 

[27] The method of determining redundant files lends itself for use with any number of 
■il 5 data files that can be fingerprinted. For example, audio files can easily be fingerprinted 
" utilizing an FFT algorithm. Similarly, video files could be fingerprinted with an FFT 
fl algorithm. In the case of an audio file, a redundancy standard could be established by 
ij estabUshing a range of fi-equencies identified in the FFT and the percentage of the level of 
ri those frequencies that must match. 

^■""20 [28] Once a fingerprint has been generated, it can be retained and appended as metadata to 
the file indicator. Thus, it could be associated with a file name in a file directory. 
Consequently, one could click on an FFT indicator next to a file name in a Microsofi; 
Windows file directory to bring up a FFT fingerprint for that file. This would simply involve 
linking the FFT data to the file name in the file directory. Thus, the fingerprint could be 

25 stored with the file in a database. As a result of this association, the fingerprint and file name 
or other identifier can be cataloged in a database. For example, if a database of video files is 
created by an entity on the internet, that entity could create a master database of content 
objects offered for streaming to viewing customers. Thus, the entity could distribute content 
object files across its system and associate a fingerprint with each of the files. The master 

30 database could retain a fingerprint for each file as well and utilize the fingerprint for 

housekeeping functions. For example, such housekeeping functions could be performed on 
remote databases such as caching servers to remove any redundant content files stored on the 
caching. 
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[29] In Fig. 2, an embodiment of the invention for identifying a file is described. In 
flowchart 200 of Fig. 2, a file is accessed as shown by block 204. For example, the file might 
simply be accessed from a file directory utilizing its file name as an identifier. A frequency 
representation of the file can then be derived as indicated in block 208. As one example, a 
5 FFT could be generated utilizing the file data so as to generate FFT data for that file. In 
block 212 a file name is provided for the file. Typically, this will simply involve using the 
same name that was previously used to access the file. However, one could easily provide a 
new name for the file. In block 216 the file name can be disposed in a directory for the file. 
Again, this could simply involve saving the file under a new name in a directory. In addition, 
10 as shown in block 220, the frequency representation of the file is associated with the file 
name. As noted above, this could occur by linking the frequency representation data 
generated by an FFT with the file name given in the file directory. Thus, anyone wishing to 
J3 view further identifiers of a particular file that has an associated firequency representation can 
pull up the FFT data or other firequency representation data and view the firequency 
characteristics for that file. In block 224 the firequency representation could be summarized 
;p with an indicator that is displayed as part of a file descriptor. For example, a file description 
;f , in an internet address could be provided in which frequency data is related as metadata as part 
I* of that file address. 

[30] Fig. 3 illustrates an embodiment of the invention for identifying an unknown file. In 
Qo sorhe instances, it is desirable to take a known file and search for other occurrences of that 

known file. Typically, this is accomplished utilizing a file name associated with file data and 
searching based on that file name. There is an assumption that the file name has not been 
mislabeled for that file data. As noted above in the case of Napster bombing, this is an 
incorrect assumption as file names are often purposely listed incorrectly so as to dupe the 
25 listening public into downloading a new artist's song. Thus, this embodiment of the 

invention can be utilized to search based on frequency characteristics of a song as opposed to 
a simple file name which can easily be corrupted. 

[31] In Fig. 3, a method 300 is shown illustirating an embodiment of the invention for 
identifying an unknown file. In block 304, a desired file is obtained and a first frequency 
30 representation of that file is generated. For example, this can be performed by obtaining a 
video file and generating an FFT based on the first five minutes of the audio portion of that 
video file. Alternatively, the FFT could be performed on the video portion of the video file. 
Thus a firequency representation indicative of that video file would be generated as a 
fingerprint. In block 308, an unknown file in a file directory is accessed. For example, a 



caching server which stores multiple files for downloading by commercial customers in a 

video distribution network could be accessed and a first unknown file from that caching 

server obtained. In block 312, a frequency representation of this unknown file is generated so 

as to produce a second frequency representation for comparison to the frequency 

5 representation earlier generated for the desired file. Consequently, in block 316 the 

comparison of the first frequency representation of the desired file is performed with the 

second frequency representation of the unknown file. In block 320 a determination is made 

from this comparison as to whether the unknown file is equivalent to the desired file, a 

predetermined standard could be utilized to make this determination. It is envisioned that a 

1 0 user could be given a template of characteristics to choose from in determining the criteria to 

be used in the standard. Alternatively, the standard could be predefined by a standardizing 

body which estabUshed criteria for determining when two files are equivalent. Once the 

determination is made, a user could act upon the conclusion, such as deleting a redundant file 

Hi or performing further comparisons to determine whether additional portions, such as the 
pi ..... 
;d|5 entire data file, are equivalent. Thus, a program -could be devised to compare initial portions 

^ of fingerprints for eliminating files that are clearly dissimilar and then through a repetitive 

I* process determining those files that are actual equivalents based upon subsequent 

y= 

'^i comparisons. 

W [32] Fig. 4 illustrates an embodiment ofthe invention that could be used for accomplishing 
ifio the methods of Figs. 1, 2, and 3. In Fig. 4 a system 400 is illustrated in which a network, 

suchastheintemet416, is shown. In Fig. 4, a first computer 412 housing a database of files 
is shown. Furthermore, a second computer 404 is shown which can communicate through the 
network 416 with first computer 412. In addition, a third computer 408 is shown which can 
communicate with both computers 404 and 408 via the network 416. As but one example, 
25 this could embody a video streaming network in which computer 404 serves as a caching 
server for maintaining copies of video files that are distributed to customers, such as a 
customer using third computer 408. A master computer or master database could be 
accomplished with first computer 412. Thus, from time to tune, the master computer will 
desire to eliminate redundant files that are stored on its own database in first computer 412 as 
30 well as on a remote computer such as caching server 404. With this system, peer-to-peer 
communications could be established between the three computers. 
[33] FIG. 5 broadly illustrates how individual system elements can be implemented in a 
separated or more integrated manner within various, generally similarly configured 
processing systems. System 500 is shown comprised of hardware elements that are 



7 



electrically coupled via bus 508, including a processor 501, input device 502, output device 
503, storage device 504, computer-readable storage media reader 505a, communications 
system 506 processing acceleration (e.g., DSP or special-purpose processors) 507 and 
memory 509. Computer-readable storage media reader 505a is further connected to 
computer-readable storage media 505b, the combination comprehensively representing 
remote, local, fixed and/or removable storage devices plus storage media, meinory, etc. for 
temporarily and/or more permanently containing computer-readable information, which can 
include storage device 504, memory 509 and/or any other such accessible system 500 
resource. System 500 also comprises software elements (shown as being currently located 
within working memory 591) including an operating system 592 and other code 593, such as 
programs, applets, data and the like. 

[34] System 500 is desirable as an implementation alternative largely due to its extensive 
flexibility and configurability. Thus, for example, a single architecture might be utilized to 
implement one or more servers that can be further configured in accordance with currently 
desirable protocols, protocol variations, extensions, etc. However, it will be apparent to those 
skilled in the art that substantial variations may well be utilized in accordance with more 
specific application requirements. For example, one or more elements might be implemented 
as sub-elements within a system 500 component (e.g. within communications system 506). 
Customized hardware might also be utilized and/or particular elements might be implemented 
in hardware, software (including so-called "portable software," such as applets) or both. 
Further, while cormection to other computing devices such as network input/output devices 
(not shown) may be employed, it is to be understood that wired, wireless, modem and/or 
other connection or connections to other computing devices might also be utilized. 
Distributed processing, multiple site viewing, information forwarding, collaboration, remote 
information retrieval and merging, and related capabilities are each contemplated. Operating 
system utilization will also vary depending on the particular host devices and/or process types 
(e.g. computer, appliance, portable device, etc.) and certainly not all system 500 components 
will be required in all cases. 

[35] While various embodiments of the invention have been described as methods or 
apparatus for implementing the invention, it should be understood that the invention can be 
implemented through code coupled to a computer, e.g., code resident on a computer or 
accessible by the computer. For example, software and databases could be utiUzed to 
implement many of the methods discussed above. Thus, in addition to embodiments where 
the invention is accomplished by hardware, it is also noted that these embodiments can be 
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accomplished through the use of an article of manufacture comprised of a computer usable 
medium having a computer readable program code embodied therein, which causes the 
enablement of the functions disclosed in this description. Therefore, it is desired that 
embodiments of the invention also be considered protected by this patent in their program 
5 code means as well. 

[36] It is also envisioned that embodiments of the invention could be accomplished as 
computer signals embodied in a carrier wave, as well as signals (e.g., electrical and optical) 
propagated through a transmission medium. Thus, the various information discussed above 
could be formatted in a structure, such as a data structure, and transmitted as an electrical 
1 0 signal through a transmission medium or stored on a computer readable medium. 

[37] It is also noted that many of the structures, materials, and acts recited herein can be 
recited as means for performing a function or steps for performing a fimction. Therefore, it 
should be understood that such language is entitled to cover all such structures, materials, or 
acts disclosed within this specification and their equivalents. 
"15 [38] It is thought that the apparatuses and methods of the embodiments of the present 
3 invention and many of its attendant advantages will be understood from this specification and 
^ it will be apparent that various changes may be made in the form, construction, and 

arrangement of the parts thereof without departing from the spirit and scope of the invention 
or sacrificing all of its material advantages, the form herein before described being merely 
''^O exemplary embodiments thereof 
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